CA5 Phase2

The purpose of this project is to create a Neural Network with TensorFlow and Keras library and Train it with given dataset and then test it with given test dataset

Imports

Body

Part 1: Data Analysis and Preprocess¶

As seen above, labels are one hot encoded

We dont have an order or rank in our classes, but, when label encoding is performed, the class names are ranked. Due to this, there is a very high probability that the model captures the relationship between classes based on their ranks. So, we use One Hot Encoding to solve this problem.

Shuffle dataframe befor dividing to Train and Validation:

Part 2: Make Neural Network

The number of parameters is equal to the product of the number of nodes in two consecutive layers.

Part 3: Data Classification

Section 1: Optimizer Effect

Momentum is an extension to the gradient descent optimization algorithm, often referred to as gradient descent with momentum.

It is designed to accelerate the optimization process, e.g. decrease the number of function evaluations required to reach the optima, or to improve the capability of the optimization algorithm, e.g. result in a better final result.

Momentum involves adding an additional hyperparameter that controls the amount of history (momentum) to include in the update equation, i.e. the step to a new point in the search space. The value for the hyperparameter is defined in the range 0.0 to 1.0 and often has a value close to 1.0, such as 0.8, 0.9, or 0.99. A momentum of 0.0 is the same as gradient descent without momentum.

Momentum is most useful in optimization problems where the objective function has a large amount of curvature (e.g. changes a lot), meaning that the gradient may change a lot over relatively small regions of the search space.

It is also helpful when the gradient is estimated, such as from a simulation, and may be noisy, e.g. when the gradient has a high variance.

Finally, momentum is helpful when the search space is flat or nearly flat, e.g. zero gradient. The momentum allows the search to progress in the same direction as before the flat spot and helpfully cross the flat region.

refrence : https://machinelearningmastery.com/gradient-descent-with-momentum-from-scratch/

Model With Momentum = 0.5

Model With Momentum = 0.9

NO, a large momentum (e.g. 0.9) will mean that the update is strongly influenced by the previous update, whereas a modest momentum (0.2) will mean very little influence. So increase the momentum does not always improve the result.

Model with Adam

Compare Adam and SGD:

Adam learn faster so 10 epoch is enough and has better accuracy and better result than SGD. (Especially on Train Data)

Section 2: Epoch Effect

1: Because one epoch is not enough for learn all weights with proper values because they set randomly at the first time and we need visit Train data more than one time to learn enough.

If Train Data is big enough, one epoch may be enough.

2 : Overfitting

when model overfitted, then accuracy on validation data will be decreased.

3: Early Stopping in Keras

4: NO, It cause overfitting.

We can solve that with: 1.Reduce overfitting by training the network on more examples. 2. Reduce overfitting by changing the complexity of the network.

Section 3: Loss Function Effect

Compare MSE with Categorical Cross Entropy:

MSE:

Categorical Cross Entropy:

MSE:

Categorical Cross Entropy:

The result with categorical cross entropy is better than MSE.

2:

First, using MSE means that we assume that the underlying data has been generated from a normal distribution (a bell-shaped curve). In Bayesian terms this means we assume a Gaussian prior. While in reality, a dataset that can be classified into two categories (i.e binary) is not from a normal distribution but a Bernoulli distribution.

Secondly, the MSE function is non-convex for binary classification. In simple terms, if a binary classification model is trained with MSE Cost function, it is not guaranteed to minimize the Cost function. This is because MSE function expects real-valued inputs in range(-∞, ∞), while binary classification models output probabilities in range(0,1) through the sigmoid/logistic function.

MSE is a good choice for a Cost function when we are doing Linear Regression.

Section 4: Regularization Effect

1 : L2

Regularization is a technique for preventing over-fitting by penalizing a model for having large weights. So when we use L2 regularization for each layer, less overfitting occurs.

2: Dropout

Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out” randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.

The effect is that the network becomes less sensitive to the specific weights of neurons. This in turn results in a network that is capable of better generalization and is less likely to overfit the training data.

So, when we use Dropout Regularization for each layer, less overfitting occurs.

refrence : https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/

Test Best Model with the Test Dataset:

Load Test Data:

As seen above, labels are one hot encoded based on Training data Classes encodes.

Predict with Best Model:

According to the above results, model_10 (model with Dropout layer and 20 epochs) is the best.

Similarities between animals like Bald Eagle and Raven may lead to misdiagnosis.

An other case may be overfitting in model.

The other case may be Similarities in background of images and Nueral Net work learn background features instead of animals.